The take home portion of Exam 1 will be submitted via Canvas as an R Markdown file. Any question requiring a hypothesis test should include the following structure:
A local administrator wants to compare the starting salaries for teachers in Kentucky and Ohio. A sample of 40 KY teachers is taken and each is asked to report their current salary. Similarly, a sample of 55 OH teachers is taken and asked the same question. The data obtained is posted in a .csv file on Canvas.
\(\underline{\textbf{Parameters and Hypothesis Test}}\)
\(Q_{1(OH)} = First Quartile for Ohio teachers\)
\(Q_{1-KY} = First Quartile for Kentucky teachers\)
\(H_0: Q_{1(OH)} = Q_{1(KY)}\)
\(H_A: Q_{1(OH)} > Q_{1(KY)}\)
\(\alpha = 0.10\)
\(\underline{\textbf{Test Statistic}}\)
\(T(X) = Q_{1(OH)} - Q_{1(KY)}\)
library(tidyverse)
salaries = read_csv("salary.csv")
# Create vectors for OH and KY salaries
salaries.OH = filter(salaries, State == "OH") %>% select(Salary)
salaries.KY = filter(salaries, State == "KY") %>% select(Salary)
# Calculate first quantile for OH and KY
q1.OH = quantile(salaries.OH$Salary, probs = 0.25)
q1.KY = quantile(salaries.KY$Salary, probs = 0.25)
q1.obs.ts = q1.OH - q1.KY
The observed test statistic \(T(X) = Q_{1(OH)} - Q_{1(KY)}\) = 536.5
\(\underline{\textbf{The Null Distribution of T(X) and the observed test statistic}}\)
# Create vector for storing simulated test statistic
q1.simulated.ts = numeric(9999)
for (i in 1:9999){
# Create vecotor of indicies (representing Ohio) for sampling without replacement randomly from salaries, then calculate simulated test statistic
index = sample(95, 55, replace = F )
q1.simulated.ts[i] = quantile(salaries$Salary[index], probs = 0.25) - quantile(salaries$Salary[-index], probs = 0.25)
}
hist(q1.simulated.ts, main = "Null Distribution of T(X)", xlab = "Simulated Test Statistic: Q1(OH) - Q1(KY)")
abline(v = q1.obs.ts, col="blue", lwd = 3)
q1.pvalue = sum(q1.simulated.ts >= q1.obs.ts + 1) / 10000
Conclusion: The p-value for the hypothesis test is 0.0425. A p-value of 0.0425 suggests that the data collected would be unlikely to occur if the null hypothesis is true. At a 0.10 level of significance, there is evidence to reject the null hypothesis and conclude that the first quartile of Ohio teachers salary is greater than the first quartile of Kentucky teachers salary.
Doubling the sample size for the two states would not affect the center of the null distribution. This particular distribution would still be centered at 0. However the variability, or spread, of the distribution would decrease as the sample size increases.
par(new=T)
plot(density(salaries.OH$Salary, kernel = "gaussian", bw = sd(salaries.OH$Salary)/sqrt(length(salaries.OH$Salary))), main = "Comparison of Teacher Salaries in Ohio and Kentucky", sub = "Kernel Density Estiamte", xlim=c(35000, 45000), ylim=c(0, 0.0005), col="red", lwd=2)
par(new=T)
plot(density(salaries.KY$Salary, kernel = "gaussian", bw=sd(salaries.KY$Salary)/sqrt(length(salaries.KY$Salary))), main = "", sub = "", xlim=c(35000, 45000), ylim=c(0, 0.0005), col="blue", lwd=2)
A test statistic which could be used to test for equal variability would be the difference of the standard deviations for each population: \(T(X) = SD_{OH} - SD_{KY}\). The most direct way of testing for equal variances would be to test for a difference in the sum of the variances for the samples. Since the samples for Ohio and Kentucky are not the same size, standardizing the variances is neccessary making the choice of using the standard deviation intuitive. Taking a difference of the standard deviations would provide for a consistent way to determine if the variance is the same for each population. For variances that are approximately equal the null distribution should be centered around 0. Large values, either positive or negative, would provide evidence that the variances of the populations are not equal.
\(\underline{\textbf{Parameters and Hypothesis Test}}\)
\(SD_{OH} = Standard Deviation for Ohio teachers salary\)
\(SD_{KY} = Standard Deviation for Kentucky teachers salary\)
\(H_0: SD_{OH} = SD_{KY}\)
\(H_A: SD_{OH} \ne SD_{KY}\)
\(\alpha = 0.10\)
\(\underline{\textbf{Test Statistic}}\)
\(T(X) = SD_{OH} - SD_{KY}\)
# Observed test statistic
sd.diff.obs.ts = sd(salaries.OH$Salary) - sd(salaries.KY$Salary)
The observed test statistic \(T(X) = SD_{OH} - SD_{KY}\) = 37.0323766
\(\underline{\textbf{The Null Distribution of T(X) and the observed test statistic}}\)
# Vector for simulated test statistic
sd.diff.sim.ts = numeric(9999)
for (i in 1:9999){
index = sample(95, 55, replace = F)
sd.diff.sim.ts[i] = sd(salaries$Salary[index]) - sd(salaries$Salary[-index])
}
hist(sd.diff.sim.ts, main = "Null Distribution of T(X)", xlab = "Simulated Test Statistic SD(OH) - SD(KY)")
abline(v = sd.diff.obs.ts, col = "green", lwd = 3)
sd.diff.pvalue = sum(sd.diff.sim.ts >= sd.diff.obs.ts + 1) / 10000
Conclusion: The p-value for the hypothesis test is 0.4437. A p-value of 0.4437 suggests that the data collected would be likely to occur if the null hypothesis is true. Stated more simply, there is evidence to support the null hypothesis and assume that the variability of Ohio teachers salary and Kentucky teachers salary are approximately equal.
In class we considered the data collected from the Mythbusters’ experiment investigating contagious yawning. When investigating this myth, the hosts concluded their data confirmed the myth, which we found to be statistically incorrect.
Suppose that they plan to repeat the experiment but would like input on the sample size which should be used. Using the results from their initial experiment as a starting point, they would like to plan a similar experiment but with a sample size large enough to detect a difference of 4.41% between the two groups in yawning percentages (the difference found in the original experiment).
The following code has been written to generate the power for a sample size of 50 individuals split into a seeded and unseeded group as described in the original experiment. Provide a thorough explanation for the code - think about explaining each line of code and then form this into a cohesive description. Your explanation should not simply state what the function being used does, but it should explain the process being completed in the context of this problem. (5 points)
# Create function to call for calculating power
mb_power = function(size){
pvalue <- numeric(500)
# From the original experiment, roughly 2 people in seeded group and 1 person in the control. Used 34/50 = .68 for the seeded and 16/50 = .32 for the control group to try to precisely reflect the original experiment
seed_size = size*.68
control_size = size*.32
for(i in 1:500){
yawn_s <- rbinom(seed_size, 1, 0.29412)
yawn_c <- rbinom(control_size, 1, 0.25)
obs_ts <- sum(yawn_s)/seed_size - sum(yawn_c)/control_size
permutation <- replicate(9999, sample(c(yawn_s, yawn_c), size))
null_distn <- apply(permutation[1:seed_size,],2,sum)/seed_size - apply(permutation[(seed_size+1):size,],2,sum)/control_size
pvalue[i] <- (sum(null_distn >= obs_ts)+1)/10000
}
sum(pvalue <= 0.05)/500
}
# Calculate power for each sample size
power.n50 = mb_power(50)
power.n250 = mb_power(250)
power.n500 = mb_power(500)
power.n750 = mb_power(750)
power.n1000 = mb_power(1000)
power.n1500 = mb_power(1500)
power.n3000 = mb_power(3000)
power.n5000 = mb_power(5000)
# Create data frame of sample size and the associated power
size = c(50, 250, 500, 750, 1000, 1500, 3000, 5000)
powr = c(power.n50, power.n250, power.n500, power.n750, power.n1000, power.n1500, power.n3000, power.n5000)
power.df = data.frame(size, powr)
power.plot = ggplot(power.df, aes(x = size, y = powr)) + geom_point() + geom_line() + ggtitle('MythBusters Example Power Curve') + xlab('Sample Size') + ylab('Power') + ylim(0, 1) + geom_hline(yintercept = 0.8, color = 'blue')
#library(plotly)
plotly::ggplotly(power.plot)
| Total Sample Size | # Assigned to Seeded Group | # Assigned to Control Group | Power |
|---|---|---|---|
| 50 | 34 | 16 | 0.038 |
| 250 | 170 | 80 | 0.158 |
| 500 | 340 | 160 | 0.228 |
| 750 | 510 | 240 | 0.324 |
| 1000 | 680 | 320 | 0.374 |
| 1500 | 1,020 | 480 | 0.488 |
| 3000 | 2,040 | 960 | 0.816 |
| 5000 | 3,400 | 1,600 | 0.964 |
# Calculate slope and intercept for the line connecting 3000 and 5000 samples
slope = (power.n3000 - power.n5000) / (3000 - 5000)
intercept = power.n3000 - slope*3000
recommend = (0.8 - intercept) / slope
seed.num = round(recommend*0.68)
control.num = round(recommend*.32)
Using the power curve from part B, I would recommend MythBusters to use a sample size of 2783.7837838 if they were to conduct the experiment again to achieve a power of 80%. Using a sample of this size there would be approximately 1893 in the seeded group and 891 in the control group.
The experimental design used by the show has been questioned by some who believe it would have been advantageous to use equal sample sizes for the seeded and unseeded groups. Generate a power curve (similar to what was obtained above) using a design with equal sample sizes in the seeded and unseeded groups. Comment on whether this design would be better than what was originally used on the show. (10 points)
# Create function to call for calculating power
mb_power_eqSize = function(size){
pvalue <- numeric(500)
# From the original experiment, roughly 2 people in seeded group and 1 person in the control. Used 34/50 = .68 for the seeded and 16/50 = .32 for the control group to try to precisely reflect the original experiment
seed_size = size*.5
control_size = size*.5
for(i in 1:500){
yawn_s <- rbinom(seed_size, 1, 0.29412)
yawn_c <- rbinom(control_size, 1, 0.25)
obs_ts <- sum(yawn_s)/seed_size - sum(yawn_c)/control_size
permutation <- replicate(9999, sample(c(yawn_s, yawn_c), size))
null_distn <- apply(permutation[1:seed_size,],2,sum)/seed_size - apply(permutation[(seed_size+1):size,],2,sum)/control_size
pvalue[i] <- (sum(null_distn >= obs_ts)+1)/10000
}
sum(pvalue <= 0.05)/500
}
# Calculate power for each sample size
power2.n50 = mb_power_eqSize(50)
power2.n250 = mb_power_eqSize(250)
power2.n500 = mb_power_eqSize(500)
power2.n750 = mb_power_eqSize(750)
power2.n1000 = mb_power_eqSize(1000)
power2.n1500 = mb_power_eqSize(1500)
power2.n3000 = mb_power_eqSize(3000)
power2.n5000 = mb_power_eqSize(5000)
# Create data frame of sample size and the associated power
size2 = c(50, 250, 500, 750, 1000, 1500, 3000, 5000)
powr2 = c(power2.n50, power2.n250, power2.n500, power2.n750, power2.n1000, power2.n1500, power2.n3000, power2.n5000)
power2.df = data.frame(size2, powr2)
power2.plot = ggplot(power2.df, aes(x = size2, y = powr2)) + geom_point() + geom_line() + ggtitle('MythBusters Example Power Curve - Equal Sample Size') + xlab('Sample Size') + ylab('Power') + ylim(0, 1) + geom_hline(yintercept = 0.8, color = 'blue')
plotly::ggplotly(power2.plot)
| Total Sample Size | # Assigned to Seeded Group | # Assigned to Control Group | Power |
|---|---|---|---|
| 50 | 25 | 25 | 0.066 |
| 250 | 125 | 125 | 0.168 |
| 500 | 250 | 250 | 0.254 |
| 750 | 375 | 375 | 0.32 |
| 1000 | 500 | 500 | 0.426 |
| 1500 | 750 | 750 | 0.598 |
| 3000 | 1500 | 1500 | 0.824 |
| 5000 | 2500 | 2500 | 0.976 |
Conducting an experiment in which each group has equal sample sizes would result in a more powerful experiment. Using this approach would allow for the same amount of power to be generated with fewer samples.
Is there a difference in the price of groceries sold by Target and Wal-Mart? The data posted in the .csv file on Canvas contains a sample of grocery items and their prices advertised on their respective web sites on one specific day.
Explain why this is an example of matched pairs data, not two independent samples. (2 points)
+The price of groceries at Target and Wal-Mart is an example of dependent samples because both companies will monitor their competitor’s prices and will change their price of a product to reflect changes in their competitors prices. The prices of groceries at both companies are not independently set by the company, rather prices of groceries are determined by the wholesale cost of the grocery plus the price competitors are willing to charge for the same product.
When the data is dependent, inferences are conducted on the differences between observations. However, when the prices from the two stores are subtracted, we are left with a single vector of values. Using the steps below, conduct a hypothesis test to determine if there is a difference in the mean price of items at the two stores using a significance level of 0.10. (8 points)
\(\underline{\textbf{Parameters and Hypothesis Test}}\)
\(Price_{T} = Price of Target grocery\)
\(Price_{WM} = Price of Wal-Mart grocery\)
\(H_0: Price_T = Price_{WM}\)
\(H_A: Price_T \ne Price_{WM}\)
$= 0.10
\(\underline{\textbf{Test Statistic}}\)
\(\bar{Y}_d = \frac{\sum (Price_T - Price_{WM})}{number of grocery items compared}\)
\(\underline{\textbf{Null Distribution}}\)
# Read in Groceries data
groceries = read_csv("Groceries.csv")
# Calculate the difference in the observed prices, then calculate the mean of the difference to find the value of the observed test stastitic
groc.diff.obs = groceries$Target - groceries$Walmart
groc.obs.ts = mean(groc.diff.obs)
groc.sim.ts = numeric(9999)
for (i in 1:9999){
sign = sample(c(-1, 1), 30, replace = T)
groc.diff.sim = sign * groc.diff.obs
groc.sim.ts[i] = mean(groc.diff.sim)
}
hist(groc.sim.ts, main = "Null Distribution of Y-bar", xlab = "Simulated Test Statistic Price(T) - Price(WM)")
abline(v=groc.obs.ts, col="purple", lwd=3)
groc.pvalue = (sum(groc.sim.ts >= groc.obs.ts) + 1) / 10000
Conclusion: The p-value for the hypothesis test is 0.3489. A p-value of 0.3489 suggests that the data collected would be likely to occur if the null hypothesis is true. At a 0.10 level of significance, there is evidence to support the null hypothesis and assume that the prices for groceries at Target is approximately equal to the prices of groceries at Wal-Mart.